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Abstract — Gaussian processes (GPs) are versatile tools 
that have been successfully employed to solve nonlinear 
estimation problems in machine learning, but that are 
rarely used in signal processing. In this tutorial, we present 
GPs for regression as a natural nonlinear extension to 
optimal Wiener filtering. After establishing their basic 
formulation, we discuss several important aspects and 
extensions, including recursive and adaptive algorithms 
for dealing with non-stationarity, low-complexity solutions, 
non-Gaussian noise models and classification scenarios. 
Furthermore, we provide a selection of relevant applica- 
tions to wireless digital communications. 



I. Introduction 

Gaussian processes (GPs) are Bayesian state-of-the-art 
tools for discriminative machine learning, i.e., regression 
(TJ, classification [2] and dimensionality reduction (3j. 
GPs were first proposed in statistics by Tony O'Hagan 
Q and they are well-known to the geostatistics commu- 
nity as kriging. However, due to their high computational 
complexity they did not become widely applied tools in 
machine learning until the early XXI century |5j. GPs 
can be interpreted as a family of kernel methods with the 
additional advantage of providing a full conditional sta- 
tistical description for the predicted variable, which can 
be primarily used to establish confidence intervals and to 
set hyper-parameters. In a nutshell, Gaussian processes 
assume that a Gaussian process prior governs the set of 
possible latent functions (which are unobserved), and the 
likelihood (of the latent function) and observations shape 
this prior to produce posterior probabilistic estimates. 
Consequently, the joint distribution of training and test 
data is a multidimensional Gaussian and the predicted 
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distribution is estimated by conditioning on the training 
data. 

While GPs are well-established tools in machine 
learning, they are not as widely used by the signal 
processing community as neural networks or support 
vector machines (SVMs) are. In our opinion, there are 
several explanations for the limited use of GPs in signal 
processing problems. First, they do not have a simple 
intuition for classification problems. Second, their direct 
implementation is computationally demanding. Third, 
their plain vanilla version might seem uptight and not 
flexible enough. Fourth, to most signal processing ex- 
perts Gaussian process merely stands for a noise model 
and not for a flexible algorithm that they should be using. 

In this paper, we present an overview on Gaussian 
processes explained for and by signal processing prac- 
titioners. We introduce GPs as the natural nonlinear 
Bayesian extension to the linear minimum mean square 
error (MMSE) and Wiener filtering, which are central to 
many signal processing algorithms and applications. We 
believe that GPs provide the correct approach to solve an 
MMSE filter nonlinearly, because they naturally extend 
least squares to nonlinear solutions through the kernel 
trick; they use a simple yet flexible prior to control the 
nonlinearity; and, evidence sampling or maximization 
allows setting the hyper-parameters without overfltting. 
This last feature is most interesting: by avoiding cross- 
validation we are able to optimize over a larger number 
of hyperparameters, thus increasing the available kernel 
expressiveness. Additionally, GP provides a full statisti- 
cal description of its predictions. 

The tutorial is divided in three parts. We have summa- 
rized in Figure [T] the relationship between the regression 
techniques introduced throughout the different sections. 
In the first part, Section [n] provides a detailed overview 
of Gaussian processes for regression (GPR) [1]. We 
show that they are the natural nonlinear extension to 
MMSE/Wiener filtering and how they can be solved 
recursively. The second part of the paper focuses briefly 
on several key aspects of GP-based techniques. Consec- 
utively, we review solutions to adjust the kernel function 



(Section III ), to tame the computational complexity 
of GPs (Section [IV] ), and to deal with non-Gaussian 
noise models (Section [V] ). In the third part, we cover 
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Fig. 1: Relationship between the regression techniques 
discussed in this tutorial. 



additional extensions of interest to signal processing 
practitioners, in particular dealing with non-stationary 
scenarios (Section VI ) and classification problems (Sec- 
tion 



VII ). We illustrate them with relevant examples 



in signal processing for wireless communications. We 
conclude the paper with a discussion. 

II. Gaussian Processes for Machine Learning 

A. Minimum mean square error: a starting point 

GPs can be introduced in a number of ways and we, 
as signal processing practitioners, find it particularly ap- 
pealing to start from the MMSE solution. This is because 
the Wiener solution, which is obtained by minimizing the 
MSE criterion, is our first approach to most estimation 
problems and, as we show, GPs are its natural Bayesian 
extension. 

Many signal processing problems reduce to estimating 
from an observed random process x G W another related 
process y G R. These two processes are related by a 
probabilistic, possibly unknown, model p(x\y). It is well 
known that the unconstrained MMSE estimate, 



ar gmin E 

/(x) 



|y-/(x) 



(i) 



coincides with the conditional mean estimate of y given 



x 



X 



E[y\x\ 



yp(y\x)dy = y 



p(*\y)p{y) 



Furthermore, these expectations can be easily estimated, 
using the sample mean, from independently and iden- 
tically distributed (iid) samples drawn from p(x|y) and 
p(y), namely V n = {x.i,yi}? =v 

However, if x is not linearly related to y (plus 
Gaussian noise) or y is not Gaussian distributed, the 
conditional estimate of y given x is no longer linear. 
Computing the nonlinear conditional mean estimate in 
Q directly from T> n either leads to overfitted solutions, 
because there are no convergence guarantees for general 
density estimation [6], or to suboptimal solutions, if we 
restrict the density model to come from a narrow class 
of distributions. For instance, in channel equalization, 
although suboptimal, the sampled version of ([3]) is used 
due to its simplicity. One viable solution would be to 
minimize the sampled version of ([T]) with a restricted 
family of approximating functions to avoid overfitting. 
Kernel least squares (KLS) [7] and Gaussian process 
regression, among others, follow such approach. 

B. Gaussian Processes for Regression 

In its simplest form, GPR models the output nonlin- 
early according to 



y = /( x ) + ^ 



(4) 



and it follows ([T]), without assuming that x and y sue 
linearly related or that p(y) is Gaussian distributed. 
Nevertheless, it still considers that p(y|x) is Gaussian 
distributed, i.e., v is a zero-mean GaussiarQ In this way, 
GP can be understood as a natural nonlinear extension 
to MMSE estimation. Additionally, GPR does not only 
estimate ([2]) from V n , but it also provides a full statistical 
description of y given x, namely 



p(y|x,£V, 



(5) 



p(x) 



If p(y,x.) is jointly Gaussian, i.e. p{y) and p(x\y) are 
Gaussians and -E[x|y] is linear in y, this solution is linear. 
If y and x are zero mean, the solution yields i£[y|x] = 
w T x, where 



w 



wtnse = argminS 

w 



y 



wx 



E 



XX 



GPs can be presented as a nonlinear regressor that 
expresses the input-output relation in ((4]) by assum- 
ing that a real- valued function /(x), known as latent 
function, underlies the regression problem and that this 
function follows a Gaussian process. Before the la- 
bels are revealed, we assume this latent function has 
dy been drawn from a Gaussian process prior. GPs are 
characterized by their mean and covariance functions, 
denoted by /i(x) and fc(x, x'), respectively. Even though 
nonzero mean priors might be of use, working with zero- 
mean priors typically represents a reasonable assumption 
and it simplifies the notation. The covariance function 
explains the correlation between each pair of points 
1 E [x$l tne i n P ut space and characterizes the functions that 

(3) 'A further relaxation to this condition is discussed in Section [vj 



(2) 
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can be described by the Gaussian process. For example, 
fc(x, x') = x T x' only yields linear latent functions and it 
is used to solve Bayesian linear regression problems, for 
which the mean of the posterior process coincides with 
the MMSE solution in ([3]), as shown in Section |H-El We 



cover the design of covariance functions in Section III 



For any finite set of inputs V n , a Gaussian process 
becomes a multidimensional Gaussian defined by its 
mean (zero in our case) and covariance matrix, (K n )jj = 
k(xi,Xj), Vxj,Xj G V n . The Gaussian process prior 

p(f n \X n )=X(0,K n ), (6) 



becomes 



where f n = [/(xi), /(x 2 ), . . . , /(x n )] T and X n = 
[xi , X2 , . . . , x n ] . We want to compute the estimate for 
a general input x, when the labels for the n training 
examples, denoted by y n = [yi, y%, ■ ■ ■ ,y n ] , are known. 
We can analytically compute Q by using the standard 
tools of Bayesian statistics: Bayes' rule, marginalization 
and conditioning. 

We first apply Bayes' rule to obtain the posterior 
density for the latent function 



p(/(x),f n |x,X> n ) 



P(yJ f n)p(/(x),fn|x,Xr, 



(7) 



where p(/(x), f n |x, X n ) is the Gaussian process prior 
in Q extended with a general input x, p(y n \f n ) is the 
likelihood for the latent function at the training set, in 
which y n is independent of X n given the latent function 
f n , and p(y n \~K n ) is the marginal likelihood or evidence 
of the model. 

The likelihood function is given by a factorized model: 



P(yj f n) = Ylp(yi\f(xi)), 



(8) 



because the samples in V n are iid. In turn, for each pair 
(/(xj),yj), the likelihood is given by Q, therefore 



p( m |/(x i ))~^(/(x i ),0. 



(9) 



A Gaussian likelihood function is conjugate to the 
Gaussian prior and hence the posterior in (|7]) is also a 
multidimensional Gaussian, which simplifies the compu- 
tations to obtain Q. If the observation model were not 
Gaussian, warped Gaussian processes (see Section [V] ) 
could be used to estimate Q. 

Finally, we can obtain the posterior density in ([5]) for 
a general input x by conditioning on the training set and 
x, and by marginalizing the latent function: 

p(y|x,D n )=/p( ! /|/(x))p(/(x)|x,D n )4f(x), (10) 



wherd3 

p(/(x)p n ,x) = J p(/(x),f n |x,P n )df n . (11) 

We have divided the marginalization in two separate 
equations to show the marginalization of the latent func- 



tion over the training set in ( 1 1 ), and the marginalization 



of the latent function at a general input x in ( |T0| ). As 
mentioned earlier, the likelihood and the prior are Gaus- 



sians and therefore the marginalization in ( fT0| ) and (jTT 



only involves Gaussian distributions. Thereby, we can 
analytically compute (10 1 and ( fTl) by using Gaussian 
conditioning and marginalization properties, leading to 
the following Gaussian density for the output: 



2 



where 



a 



/(x) 



k(x, x) 



k'C^k, 



with 



k = [fc(xi,x), fe(x 2 , x), . . . , fc(x n , x)] 

-'n — K n + (T I n . 



(12) 



(13a) 
(13b) 



(14) 
(15) 



The mean for p(y|x, T> n ) is also given by ( |13a ), i.e., 
H y = A*/( x )> an d its variance is 



f7 /(x)+ (J ^ 



(16) 



which, as expected, also accounts for the noise in the 
observation model. 



The mean prediction of GPR in ( |13a| ) is the solution 
provided by KLS, or kernel ridge regression (KRR) (7J, 
in which the covariance function takes the place of the 
kernel. However, unlike standard kernel methods, GPR 
provides error bars for each estimate in ( 13b I or ( fT6] l and 
has a natural procedure for setting the covariance/kernel 
by evidence sampling or maximization, as detailed in 



Section III In SVM or KRR the hyper-parameters are 
typically adjusted by cross-validation, needing to retrain 
the models for different settings of the hyper-parameters 
on a grid search. So, typically only one or two hyper- 
parameters can be fitted. GPs can actually learn tens 
of hyper-parameter, because either sampling or evidence 
maximization allows setting them by a hassle-free pro- 
cedure. 



2 Given the training data set, f„ takes values in '. 
of n samples of a Gaussian process. 



as it is a vector 
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Fig. 2: Example of a Gaussian process posterior in ( \l2y 
with 20 training samples, denoted by green +. Five 
instances of the posterior are plotted by thin blue lines 
and the mean of the posterior, fi y , by a red thick line. 
The shaded area denotes the error bars for the mean 
prediction: \i y ± 2a y . 



C. An example 

In Fig. [2] we include an illustrative example with 
20 training points, in which we depict §V2\ for any x 
between —3 and 4. We used standard functions from the 
GPML toolbox, available at http://www.gaussianprocess. 
org/gpmlA to generate the GP in this figure. We have 
chosen a Gaussian kernel that is fixecj^] as /c(xj,Xj) = 
exp (— 2||xj — Xj|| 2 ) and a v = 0.1. In the plot, we show 
the mean of the process in red and the shaded area 
denotes the error bar for each prediction, i.e., \i y ± 2a y . 
We also plot 5 samples from the posterior in thin blue 
lines. 

We observe three different regions in the figure. On 
the right-hand side, we do not have samples and, for 
x > 3, the GPR provides the solution given by the prior 
(zero mean and ±2). At the center, where most of the 
data points lie, we have a very accurate view of the 
latent function with small error bars (close to ±2a u ). 
On the left hand side, we only have two samples and 
we notice the mixed effect of the prior widening the 
error bars and the data points constraining the values of 
the mean to lie close to the available samples. This is 
the typical behavior of GPR, which provides an accurate 
solution where the data lies and high error bars where we 
do not have available information and, consequently, we 
presume that the prediction in that area is not accurate. 



"The kernel is typically expressed in a parametric form, see Section 



D. Recursive GPs 

In many signal processing applications, the samples 
become available sequentially and estimation algorithms 
should obtain the new solution every time a new datum 
is received. In order to keep the computational complex- 
ity low, it is more interesting to perform inexpensive 
recursive updates rather than to recalculate the entire 
batch solution. Online Gaussian Processes (8j fulfill 
these requisites as follows. 

Let us assume that we have observed the first n 
samples and that at this point the new datum x n+ i is pro- 
vided. We can readily compute the predicted distribution 
for y n +i using ( |13a[ ), ( |13b ) and ( fT6] ). Furthermore, by 
using the formula for the inverse of a partitioned matrix 
and the Woodbury identity we update C~, 1 from C" 1 

Cn 1 + C„ 1 k n+ ik^ +1 C n 1 /(Ty ri+i — C n 1 k n+ i/cr^ +i 

(17) 

and k n+ i correspond to ( fT6] l and ( p"4| ), 
respectively, for x = x n+ i. 

Nevertheless, for online scenarios, it is more con- 
venient to update the predicted mean and covariance 
matrix for all the available samples, as it is easier to 
interpret how the prediction changes with each new 



J n+1 



where a^ n t 



datum. Additionally, as we will show in Section VI 
this formulation makes the adaptation to non-stationary 
scenarios straightforward. Let us denote by /z n and S n 
the posterior mean and covariance matrix for the samples 
in V n . By applying ( |13a| ) and ( |13b| ) we obtain 



K n c; 

K„ - 



KC *K 



(18a) 
(18b) 



Once the new datum (x n+ i,y n+ i) is observed, the 
updated mean and covariance matrix can be computed 
recursively as follows: 



J ri+1 



M/(x n+1 ) 



/V(x„+i) - Vn+l 



a 



Vn+l 



(7 



h n +i 

2 

/(x„+i). 



(19a) 





h„,+i 


1 


h„+i 


hn+1 


a /(x„ + 0. 


< +1 


a /(x„+i). 



h 



n+l °/(Xn+l) 



where h n+ i 



(19b) 

KnC^jkn+i. 



^wK n 1 k ra+ i — (I 
As can be observed in <\l9a) , the mean of the new 
process is obtained by applying a correction term to 
the previous mean, proportional to the estimation error, 
A t /(x„ +1 ) — Un+i- Because of the relation between S n 
and C" 1 stated in ( |18b[ ), only one of the two matrices 
needs to be stored and updated in an online formulation. 
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Some authors [8] prefer to rely on C n 1 , whereas others 
Q store and update S„. 



The recursive update of the mean in ( |19a| ) is equiva- 
lent to what is known as kernel recursive least-squares 
(KRLS) in the signal processing literature (see for in- 
stance |8j-[10| ). The unbounded growth of the involved 
matrices, visible in (fT9|) and (fTTl), is the main limitation 



in the KRLS formulation. Practical KRLS implementa- 
tions typically either limit this growth 1 10|11 1 or even 
fix the matrix sizes (12). Nevertheless, the solution 
of KRLS is limited to the mean only and it cannot 
estimate confidence intervals. By using a GP framework, 
though, an estimate of the entire posterior distribution is 



obtained, including the covariance in ( 19b). 



E. Connection to MMSE: GPR with a linear latent 
function 

If we replace /(x) in Q with a linear model 

y = w T x + v, 

the Gaussian process prior over /(x) becomes a 
spherical-Gaussian prior distribution over w, p(w) ~ 

We can now compute the posterior for w, as we did 
for the latent function in ([7]) 

p(y|X,w)p(w) _ p(w) 



p(w\V) 



p(y|X) 



p(yl x 



w 



where p(yj|xj,w) is the likelihood. Since the prior and 
likelihood are Gaussians, so it is the posterior, and its 
mean and covariance are given by 
1 



5] w 



,-EwX'y, 



X T X/^ + I/a 



(20a) 



(20b) 



We can readily notice that ( |20a| ) is the sampled version of 
([3]), when the prior variance tends to infinity (i.e., the 
prior has no effect of the solution). The precision matrix 
(the inverse covariance) is composed of two terms: the 
first depends on the data and the other one on the 
prior over w. The effect of the prior in the mean and 
covariance fades away, as we have more available data. 
The estimate for a general input x is computed as in fTO] ) 



P(y|x,2?)=/ p(y\x,w)p(w\V)dw, 



(21) 

which is a Gaussian distribution with mean and variance 
given by: 

1 



2 



a 



2 x'S w X'y 



a 



v 



x T E Y 



x + at 



(22) 
(23) 



Equations ( [22] ) and ( |23[ > can be, respectively, rewritten 
as ( |13a| ) and ([16]), if we use the inner product between 
the Xj multiplied by the width of the prior over w, i.e. 
the kernel matrix is given by: K n = X<r^IX T . The 
kernel matrix must include the width of the prior over 
w, because the kernel matrix represents the prior of 
the Gaussian process and o\ is the prior of the linear 
Bayesian estimator. By using the Woodbury's identity, it 
follows that 



S w = all - ailX T (all + K T 



1 Xall. (24) 



Now, by replacing ( |24[ ) in ( |22[ ) and ( |23[ ), we,respectively, 
recover ( |13a| ) and ( fT6[ ). These steps connect the esti- 
mation of a Bayesian linear model and the nonlinear 
estimation using a kernel or covariance function without 
needing to explicitly indicate the nonlinear mapping. 



III. COVARIANCE FUNCTIONS 

In the previous section, we have assumed that the 
covariance functions k(x,x') are known, which is not 
typically the case. In fact, the design of a good co- 
variance function is crucial for GPs to provide accurate 
nonlinear solutions. The covariance function plays the 
same role as the kernel function in SVMs or KLS 
[171. It describes the relation between the inputs and 
its form determines the possible solutions of the GPR. 
It controls how fast the function can change or how 
the samples in one part of the input space affect the 
latent function everywhere else. For most problems, we 
can specify a parametric kernel function that captures 
any available information about the problem at hand. 
As already discussed, unlike kernel methods, GPs can 
infer these parameters, the so-called hyper-parameters, 
from the samples in V n using the Bayesian framework. 
Instead of relying on computational intensive procedures 
as cross-validation [|T3| or learning the kernel matrix 



[ [14] , as kernel methods need to. 

The covariance function must be positive semi- 
definite, as it represents the covariance matrix of a 
multidimensional Gaussian distribution. The covariance 
can be built by adding simpler covariance matrices, 
weighted by a positive hyper-parameter, or by multi- 
plying them together, as the addition and multiplication 
of positive definite matrices yields a positive definite 
matrix. In general, the design of the kernel should rely 
on the information that we have for each estimation 
problem and should be designed to get the most accurate 
solution with the least amount of samples. Nevertheless, 



the following kernel in (25 1 often works well in signal 
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processing applications 



ai exp 



^2le\\xie ~ Xji\\ 2 \+a 2 xJxj+ac38ij^ 
e=i J 



solution can be found using gradient ascent techniques. 
See [5] for further details. 



(25) 

where = [ai,7i, 72, ■ • • , 7d, «2, a 3 ] T are the hyper- 
parameters. The first term is a radial basis kernel, also 
denoted as RBF or Gaussian, with a different length- 
scale for each input dimension. This term is universal 
and allows constructing a generic nonlinear regressor. If 
we have symmetries in our problem, we can use the same 
length-scale for all dimensions: jg = 7 for £ = 1, . . . , d. 
The second term is the linear covariance function. The 
last term represents the noise variance 0:3 = (J„, which 
can be treated as an additional hyper-parameter to be 
learned from the data. We can add other terms or other 
covariance functions that allow for faster transitions, like 
the Matern kernel among others (5J. 

If the hyper-parameters, 6, are unknown, the likeli- 
hood in ([8]) and the prior in ([6]) can, respectively, be ex- 



pressecyas p(y|f , 6) and p(f |X, 0), and we can proceed 
to integrate out 6 as we did for the latent function, f , in 
Section II-B First, we compute the marginal likelihood 
of the hyper-parameters of the kernel given the training 
dataset 

p(y|X,0)= /p(y|f,0)p(f|X,0)df. (26) 

Second, we can define a prior for the hyper-parameters, 
p(0), that can be used to construct its posterior. Third, 
we integrate out the hyper-parameters to obtain the 
predictions. However, in this case, the marginal likeli- 
hood does not have a conjugate prior and the posterior 
cannot be obtained in closed form. Hence, the integration 
has to be done either by sampling or approximations. 
Although this approach is well principled, it is compu- 
tational intensive and it may be not feasible for some 
applications. For example, Markov-Chain Monte Carlo 
(MCMC) methods require several hundred to several 
thousand samples from the posterior of 6 to integrate 
it out. Interested readers can find further details in (5J. 

Alternatively, we can maximize the marginal likeli- 
hood in ( [26] ) to obtain its optimal setting fil. Although 
setting the hyper-parameters by maximum likelihood 
(ML) is not a purely Bayesian solution, it is fairly 
standard in the community and it allows using Bayesian 
solutions in time sensitive applications. This optimiza- 
tion is nonconvex (T3J, but, as we increase the number 
of training samples, the likelihood becomes a unimodal 
distribution around the ML hyper-parameters and the 

4 We have dropped the subindex n, as it is inconsequential and 
unnecessarily clutters the notation. 



parse GPs: Dealing with large-scale 

DATA SETS 

To perform inference under any GP model, the inverse 
of the covariance matrix must be computed. This is a 
costly operation, C(n 3 ), that becomes prohibitive for 
large enough n. Given the ever-increasing availability 
of large-scale databases, a lot of effort has been devoted 
over the last decade to the development of approximate 
methods that allow inference in GPs to scale linearly 
with the number of data points. These approximate 
methods are referred to as "sparse GPs", since they 
approximate the full GP model using a finite-basis-set 
expansion. This set of bases is usually spawned by using 
a common functional form with different parametriza- 
tions. For instance, it is common to use bases of the type 
{fc(z&, x)}^, where {z^,}^ — known as the active 
set — is a subset of the input samples parametrizing the 
bases. 

Under the unifying framework of fl6] |, it can be shown 
that most relevant sparse GP proposals [ 17|18] , which 
were initially thought of as entirely different low-cost 
approximations, can be expressed as exact inference 
under different modifications of the original GP prior. 
This modified prior induces a rank-m (m -C n) covari- 
ance matrix — plus optional (block) diagonal correcting 
terms — , clarifying how the reduced 0{m 2 n) cost of 
exact inference arises. 

Among the mentioned approximations, the sparse 
pseudo-input GP (SPGP) [18] is generally regarded as 
the most efficient. Unlike other alternatives, it does not 
require the active set to be a subset of the training data. 
Instead, {z{,}^ 1 can be selected to lie anywhere in the 
input space, thus increasing the flexibility of the finite 
set expansion. This selection is typically performed by 
evidence maximization. An even more flexible option, 
which does not require the active set to even lie in the 
input domain, is presented in |19|. 

Despite the success of SPGP, it is worth mentioning 
that increasing the number of bases in this algorithm 
does not yield, in general, convergence to the full GP 
solution because the active set {zi ) }^! =1 is not constrained 
to be a subset of input data. This might lead to overfiting 
in some pathological cases. A recent variational sparse 
GP proposal that guarantees convergence to the full 
GP solution while still allowing the active set to be 
unconstrained is presented in [20]. 

Further approaches yielding reduced computational 
cost involve numerical approximations to accelerate 
matrix-vector multiplications and compactly supported 
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covariance functions which set most entries of the co- 
variance matrix to zero ]2T) . 

Sparsity is often seen in online signal processing in the 
form of pruning, which restricts the active set to a subset 
of input data. The success of SPGP and its variational 
counterpart suggests that advanced forms of pruning may 
result in increased efficiency for a given sparsity level. 

V. Warped GPs: Beyond the standard noise 
model 

Even though GPs are very flexible priors for the latent 
function, they might not be suitable to model all types 
of data. It is often the case that applying a logarithmic 
transformation to the target variable of some regression 
task (e.g., those involving stock prices, measured sound 
power, etc) can enhance the ability of GPs to model it. 



In [22 1 it is shown that it is possible to include 
a non-linear preprocessing of output data h{y) (called 
warping function in this context) as part of the modeling 
process and learn it. In more detail, a parametric form 
for z = h(y) is selected, then z (which depends on the 
parameters of h(y)) is regarded as a GP, and finally, 
the parameters of h{y) are selected by maximizing 
the evidence of such GP (i.e., a ML approach). The 
authors suggest using h(y) = Ya=i a i tanh(6j(y + a)) 
as the parametric form of the warping function, but any 
option resulting in a monotonic function is valid. A non- 
parametric version of warped GPs using a variational 



approximation has been proposed in \23\. 



VI. Tracking non- stationary scenarios: 
Learning to forget 

KRLS algorithms, discussed in Section [Tl-D[ tradition- 
ally consider that the mapping function /(•) is constant 
throughout the whole learning process |10|24| . However, 
in the signal processing domain this function (which 
might represent, for instance, a fading channel) is often 
subject to changes and the model must account for 
this non-stationarity. Some kernel-based algorithms have 
been proposed to deal with non-stationary scenarios. 
They include a kernelized version of the extended RLS 



filter 1 24 1, a sliding-window KRLS approach [12| and a 
family of projection-based algorithms ]25|26| . 

In order to add adaptivity to the online GP algorithm 
described in Section |II-D[ it is necessary to make it 



"forget" the information contained in old samples. This 
becomes possible by including a "forgetting" step after 
each update 

\jl <— y/Xfj, (27a) 
£ <- AS + (1 - A)K. (27b) 



to shift the posterior distribution towards the prior (for 
< A < 1), thus effectively reducing the influence of 
older samples. Note that when using this formulation 
there is no need to store or update C _1 , see [9] for fur- 
ther details. The adaptive, GP-based algorithm obtained 
in this manner is known as KRLS-T. 



Equations ( |27| ) might seem like an ad-hoc step to 
enable forgetting. However, it can be shown that the 
whole learning procedure — including the mentioned 
controlled forgetting step — corresponds exactly to a 
principled non-stationary scheme within the GP frame- 
work, as described in [27]. It is sufficient to consider an 
augmented input space that includes the time stamp t 
of each sample and define a spatio-temporal covariance 
function: 



k st ([t x r ] r ,[t' x' r ] r ) = MM%(x,x') 



(28) 



where /c s (x, x') is the already-known spatial covariance 
function and k t (t,t') is a temporal covariance function 
giving more weight to samples that are closer in time. 
Inference on this augmented model effectively accounts 
for non-stationarity in /(•) and recent samples have 
more impact in predictions for the current time instant. 
It is fairly simple to include this augmented model 
in the online learning process described in the previ- 
ous section. When the temporal covariance is set to 
k t (t, t') = A^~ , A € (0, 1], inference in the augmented 
spatio-temporal GP model is exactly equivalent to using 



p7[ ) after each update ( |T9| ) in the algorithm of Section 



II-D which has the added benefit of being inexpensive 



and online. See [9|27|281 for further details. 

Observe that A is used here to model the speed at 
which /(•) varies, playing a similar role to that of the 
forgetting factor in linear adaptive filtering algorithms. 
When used with a linear spatial covariance, the above 
model reduces to linear extended RLS filtering. The 
selection of this parameter is usually rather ad-hoc. 
However, using the GP framework, we can select it in a 
principled manner using Type-II ML, see [27]. 

In Fig. [3] we take the example of Fig. [2] and we apply 
a forgetting factor A = 0.8. The red continuous line 
indicates the original mean function before forgetting. 
After applying one forgetting update, this mean function 
is displaced toward zero, as indicated by the the blue 
dashed line. The shaded gray area represents the error 
bars prior to forgetting. The forgetting update expands 
this area into the shaded red area, which tends to the 
prior variance of 1. 
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Fig. 3: Illustration of forgetting step ( p7| ) on the GP of 
Fig. [2j the dashed line represents the predictive mean 
that is pulled towards the prior mean, while the shaded 
red area represents the region [i y ± 2a y after forgetting. 



A. Tracking a time-selective nonlinear communication 
channel 

To illustrate the validity of the adaptive filtering algo- 
rithm, we focus on the problem of tracking a nonlinear 
Rayleigh fading channel (29] Chapter 7]. The used 
model consists of a memoryless saturating nonlinearity 
followed by a time-varying linear channel, as shown in 
Fig. |4| This model appears for instance in broadcast 
or satellite communications when the amplifier operates 
close to saturation regime (30). 

In a first, simulated setup, the time-varying linear 
fading channel consists of 5 randomly generated paths, 
and the saturating nonlinearity is chosen as y = tanh(x). 
We fix the symbol rate at T = 1/is, and we simulate two 
scenarios: one with a normalized Doppler frequency of 
fdT = 10 -4 (where fd denotes the Doppler spread), 
representing a slow-fading channel, and another one 
with fdT = 10~ 3 , corresponding to a fast time-varying 
channel. Note that a higher Doppler frequency yields 
a more difficult tracking problem, as it corresponds to 
a channel that changes faster in time. We consider a 
Gaussian source signal, and we add 30 dB of additive 
white Gaussian noise to the output signal. Given one 
input-output data pair per time instant, the tracking 
problem consists in estimating the received signal that 
corresponds to a new channel input. 

Figs. (5ja,b) illustrate the tracking results obtained by 
KRLS-T in these scenarios. As a reference, we include 
the performance of several state-of-the-art adaptive filter- 
ing algorithms, whose Matlab implementations are taken 
from the Kernel Adaptive Filtering Toolbox, available at 



Source 


Nonlinearity 




Time-Varying 






Linear Channel 



Received 
Signal 



Fig. 4: The nonlinear channel used in the example 
consists of a nonlinearity followed by a linear channel. 



compare KRLS-T with normalized least mean squares 
(NLMS), extended RLS (EX-RLS), both of which are 



linear algorithms, see |29|, and quantized kernel LMS 
(QKLMS) (3TJ, which is an efficient, kernelized version 
of the LMS algorithm. A Gaussian kernel /c(xj,Xj) = 
exp(-7||x» - XjH 2 ) is used for QKLMS and KRLS-T. 
In each scenario the optimal hyperparameters of KRLS- 
T are obtained by performing Type-II ML optimization 



(see Section III ) on a separate data set of 500 test 
samples. The optimal parameters of the other algorithms 
are obtained by performing cross-validation on the test 
data set. To avoid an unbounded growth of the matrices 
involved in KRLS-T, its memory is limited to 100 bases 
which are selected by pruning the least relevant bases 
(see (9j for details on the pruning mechanism). The 
quantization parameter of QKLMS is set to yield similar 
memory sizes. As can be seen in Figs. [5Ja,b), KRLS- 
T outperforms the other algorithms with a significant 
margin in both scenarios. By being kernel-based it is 
capable to deal with nonlinear identification problems, in 
contrast to the classical EX-RLS and NLMS algorithms. 
Furthermore, it shows excellent convergence speed and 
steady-state performance when compared to QKLMS. 
Additional experimental comparisons to other kernel 
adaptive filters can be found in [9]. 

In a second setup we used a wireless communication 
test bed that allows to evaluate the performance of 
digital communication systems in realistic indoor envi- 
ronments. This platform is composed of several transmit 
and receive nodes, each one including a radio-frequency 
front-end and baseband hardware for signal generation 
and acquisition. The front-end also incorporates a pro- 
grammable variable attenuator to control the transmit 
power value and therefore the signal saturation. A more 



http://sourceforge.net/projects/kafbox/ In particular, we 



detailed description of the test bed can be found in [32 1. 
Using the hardware platform, we reproduced the model 
corresponding to Fig. |4] by transmitting clipped orthog- 
onal frequency-division multiplexing (OFDM) signals 
centered at 5.4 GHz over real frequency-selective and 
time-varying channels. Notice that, unlike the simulated 
setup, several parameters such as the noise level and 
the variation of the channel coefficients are unknown. 
To have an idea about the channel characteristics, we 
first measured the indoor channel using the procedure 
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Fig. 5: Tracking results on a nonlinear Rayleigh fading channel: (a) simulation results for a slow-fading scenario; 
(b) simulation results for a fast time-varying scenario; (c) tracking results on data measured on the test bed with 
fast time-varying channels; (d) channel taps of the noisy linear channels, measured on the test bed setup. 



TABLE I: Steady-state NMSE performance for Fig. [5] 





NLMS 


EX-RLS 


QKLMS 


KRLS-T 


f d T = 10~\ simulated 
f d T = 1CT 3 , simulated 
f d T = 1CT 3 , real data 


-13.3 dB 
-10.6 dB 
-11.5 dB 


-13.0 dB 
-11.0 dB 
-12.5 dB 


-14.6 dB 
-9.9 dB 
-15.8 dB 


-22.3 dB 
-15.3 dB 
-21.3 dB 



described in [32]. As an example, the variation of the 
four main channel coefficients is depicted in Fig.[5jd), in- 
dicating a normalized Doppler frequency around f d T = 
10~ 3 . We then transmitted periodically OFDM signals 
with the transmit amplifier operating close to saturation 
and acquired the received signals. The transmitted and 
received signals were used to track the nonlinear channel 
variations as in the simulated setup. The results, shown 
in Fig. |5jc), are similar to those of the simulated setup. 
Finally, the steady-state NMSE performances of all three 
scenarios, Figs. [5Ja,b,c), are summarized in Table [TJ 

VII. Gaussian Processes for Classification 

For classification problems, the labels are drawn from 
a finite set and GPs return a probabilistic prediction 
for each label in the finite set, i.e., how certain is the 



classifier about its prediction. In this tutorial, we limit our 
presentation of GPs for classification (GPC) for binary 
classification problems, i.e., y$ £ {0, 1}. For GPC, we 
change the likelihood model for the latent function at x 
using a response function $(•): 



p(y = l|/(x)) = $(/(x)). 



(29) 



The response function "squashes" the real-valued latent 
function to an (0, l)-interval that represents the posterior 
probability for y [5]. Standard choices for the response 
function are 3>(a) = l/(l+exp(— a)) and the cumulative 
density function of a standard normal distribution, used 
in logistic and probit regression respectively. 

The integrals in ( [T0| ) and ( fTT| ) are now analytically 
intractable, because the likelihood and the prior are not 
conjugated. Therefore, we have to resort to numerical 
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methods or approximations to solve them. The posterior 
distribution in ([7]) is typically single-mode and the stan- 
dard methods approximate it with a Gaussian J5j. Using 
a Gaussian approximation for ^ allows exact marginal- 



ization in ( 1 1 1 and we can use numerical integration for 



solving ([TO]), as it involves marginalizing a single real- 
valued quantity. The two standard approximations are the 
Laplace method or expectation propagation (EP) | [33| . In 
0, EP is shown to be a more accurate approximation. 

A. Probabilistic channel equalization 

GPC predictive performance is similar to other non- 
linear discriminative methods, such as SVMs. However, 
if the probabilistic output is of importance, then GPC 
outperforms other kernel algorithms, because it naturally 
incorporates the confidence interval in its predictions. In 
digital communication, channel decoders follow equal- 
izers, which work optimally when accurate posterior 
estimates are given for each symbol. To illustrate that 
GPC provide accurate posterior probability estimates, 
we equalize a dispersive channel model like the one in 
Fig. |4] using GPC and SVM with a probabilistic output. 
These outputs are subsequently fed to a low-density 
parity-check (LDPC) belief-propagation based channel 
decoder to assess the quality of the estimated posterior 
probabilities. Details for the experimental set up can 
be found in [34[ in which linear and nonlinear channel 
models are tested. We now summarize the results for the 
linear channel model in that paper. 

In Fig. [6] we depict the posterior probability estimates 
versus the true posterior probability, in (a) for the GPC- 
based equalizer and in (b) for SVM-based equalizer, to 
emphasize the differences between the equalizers we use 
a highly noisy scenario with normalized signal-to-noise 
ratio of 2 dB. If we threshold at 0.5, both equalizers 
provide similar error rates and we cannot tell if there 
is an advantage from using GPC. However, if we con- 
sider the whole probability space, GPC predictions are 
significantly closer to the main diagonal that represents 
a perfect match, hence GPC provides more accurate 
predictions to the channel decoder. 

To further quantify the gain from using a GPC-based 
equalizer with accurate posterior probability estimates, 
we plot the bit error rate (BER) in Fig. [JJ after the 
probabilistic channel encoder, in which the GPC-based 
equalizer clearly outperforms the SVM-based equalizer 
and is close to the optimal solution (known channel 
and forward-backward (BCJR) equalizer). This example 
is illustrative of the results that can be expected from 
GPC when a probabilistic output is needed to perform 
optimally. 




- GPC-LDPC 
■SVM-Platt-LDPC 
■ BCJR-LDPC 



2 3 

Eb/No (dB) 



Fig. 7: GPC and SVM as probabilistic channel equalizer 
in channel LDPC decoding: BER for the GPC-LDPC 
(V), the SVM-LDPC (o) and the optimal solution (o). 



VIII. Discussion 

In this tutorial, we have presented Gaussian Processes 
for Regression in detail from the point of view of 
MMSE/Wiener filtering, so it is amenable to signal 
processing practitioners. GPR provides the same mean 
estimate as KLS or KRR for the same kernel matrix. 
On the plus side, GPR provides error bars that take into 
account the approximation error and the error from the 
likelihood model, so we know the uncertainty of our 
model for any input point (see Fig. [2] ), while KLS as- 
sumes the error bars are given by the likelihood function 
(i.e., constant for the whole input space). Additionally, 
GPR naturally allows computing the hyper-parameters 
of the kernel function by sampling or maximizing the 
marginal likelihood, being able to set tens of hyper- 
parameters, while KLS or SVM need to rely on cross- 
validation, in which only one or two parameters can be 
easily tuned. On the minus side, the GP prior imposes 
a strong assumption on the error bars that might not be 
accurate, if the latent variable model does not follow a 
Gaussian process. Although, in any case, it is better than 
not having error bars. 

We have also shown that some of the limitations of 
the standard GPR can be eased. GPs can be extended 
to non-Gaussian noise models and classification prob- 
lems, in which GPC provides an accurate a posteriori 
probability estimate. The computational complexity of 
GPs can be reduced considerably, from cubic to linear 
in the number of training examples, without significantly 
affecting the mean and error bars prediction. Finally, we 
have shown the GP can be solved iteratively, with an 
RLS formulation that can be adapted to non-stationary 
environments efficiently. 
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Fig. 6: GPC as probabilistic channel equalizer: (a) calibration curve for the GPC and (b) calibration curve for the 



Instead of covering more methods and applications in 
detail, our intention was to provide a tutorial paper on 
how to use GPs in signal processing, with a number 
of illustrative examples. Nevertheless, since we assume 
that there are several other methods and applications that 
are relevant to the reader, we finish with a brief list of 
further topics. In particular, GPs have also been applied 
to problems including modeling human motion |35| , 
source separation [36], estimating chlorophyll concentra- 



tion [37], approximating stochastic differential equations 



[38 1 and multi-user detection |39|, among others. 
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